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ABSTRACT 



Context. Baryonic Acoustic Oscillations (BAO) and their effects on the matter power spectrum can be studied by using the Lyman-a 
absorption signature of the matter density field along quasar (QSO) lines of sight. A measurement sufficiently accurate to provide 
useful cosmological constraints requires the observation of ~ 10 5 quasars in the redshift range 2.2 < z < 3.5 over ~ 8000deg 2 . Such 
a survey is planned by the Baryon Oscillation Spectroscopic Survey (BOSS) project of the Sloan Digital Sky Survey (SDSS-III). 
Aims. One of the challenges for this project is to build from five-band imaging data a list of targets that contains the largest number of 
quasars in the required redshift range. In practice, one needs a stellar rejection of more than two orders of magnitude with a selection 
efficiency for quasars better than 50% up to magnitudes as large as g ~ 22. Standard methods to identify quasars using colors work 
well for brighter quasars in the range 0.3 < z < 2.2 and g < 21 but it is necessary to develop new methods for higher redshifts and 
magnitudes. 

Methods. To obtain an appropriate target list and estimate quasar redshifts, we have developed an Artificial Neural Networks (NN) 
with a multilayer perceptron architecture. The input variables are photometric measurements, i.e. the object magnitudes and their 
errors in the five bands (ugriz) of the SDSS photometry. The NN developed for target selection provides a continuous output variable 
between for non-quasar point-like objects to 1 for quasars. A second NN estimates the QSO redshift z using the photometric 
information. 

Results. For target selection, we achieve a non-quasar point-like object rejection of 99.6% and 98.5% for a quasar efficiency of, 
respectively, 50% and 85%. The photometric redshift precision is of the order of 0.1 over the region relevant for BAO studies. These 
statistical methods, developed in the context of the BOSS project, can easily be extended to any quasar selection and/or determination 
of their photometric redshift. 

Key words. Quasars - Redshift - Neural Network 



1. Introduction 



Since the first quasar was discovered (ISchmidtLfl963h . methods 
have been developed to differentiate these rare objects from other 
astronomical sources in the sky. In the standard methods, it is 
assumed that QSOs have point-like morphology. They are then 
separated from the much more numer ous stars by their p hoto- 
metric colors. The UVX selection, e.g. dCroom et al.Ll200lh . can 
be largely complete (>90%) for QSOs with 0.3 < z < 2.2 but this 
completeness drops at higher redshift. The selection purity was 
brought up to 97% for g < 21 using Kernel Density Estim ation 
techniques applied to SD SS color s (Richa rds et all 120041) and 
extended to the infrared bv lRichards et al.l ( 2009a ) implying that 
spectroscopy is not needed to confirm the corresponding statis- 
tical sample of quasars at high galactic l atitudes. This led to the 
definition of a one-million-QSO catalog (Ri chards et all l2009b) 
down to i = 21.3 from the photo metry of SDSS Data Release 6 
dAdelman-McCarthv et alll2008l) . 

Extending quasar selection methods to higher redshifts and 
magnitudes presents several difficulties. For example, at fainter 
magnitudes, galaxies start to contaminate "point-like" photo- 
metric catalogs both because of increasing photometric errors 
and because of non-negligible contributions of AGN's in cer- 
tain bands. Nevertheless, such an extension is very desirable, not 
only to study the AGN population but also to use the quasars to 
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study the foreground absorbers. In particular studies of spatial 
correlations in the IGM from the Lyman-ff forest and/or metal 
absor ption lines are in need of higher target densi t y at high red- 
shift dPetitieanl 119971: iNusser &Haehneltl 1 19991; [Pichon et all 
120011; Icaucci et alll2008l) ~ 

More recently, it was realized that the Baryonic Acoustic 
Oscillations (BAO) could be detected in the Lyman-o- forest. 
BAO in the pre-recombination Universe imprint features in the 
matter power spectrum that have led to important constraints 
on the cosmological parameters. So far, BAO effects have been 
seen usi ng galaxies of redshift z < 0-4 t o sample the matter 
densi ty (lEisenstein et all 120051; ICole et all [20051; (Percival et all 
2009 ). The Baryon Oscillation Spec troscopic Survey (BOSS) 
(ISchlegel. White & Eisensteinl l2009t) of th e Sloan Digital Sky 
Survey (SDSS-III) dSDSS-III Colli 120081) proposes to extend 
these studies using galaxies of higher redshifts, z < 0.9. 
The BOSS project will also study BAO effects in the range 
2.2 < z < 3.5 using Lyman-o- absorption towards high red- 
sh ift quasars (QSOs) to sample th e matter density as proposed 
bv lMcDonald & Eisensteinl d2007l) . 

The power spectrum has already been measured at z ~ 
2.5 via the 1 -di mensional matter power spectrum derived from 
quasar spectra dCroft et all [1999). The observation of BAO ef- 
fects will require a full 3-dimensional sampling of the matter 
density, requiring a much higher number of quasars than previ- 
ously available. BOSS aims to study around 100,000 QSOs over 
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Fig. 1. 2D distributions of colors (u — g, g - r, r — i, i — z and g - i) for objects classified as PLO in SDSS photometric catalog (blue 
lines for contours) and for objects spectroscopically classified as QS O (red solid lines for contours). The PSF magnitudes (ugriz) 
have been corrected for Galactic extinction according to the model of Schleg eTet al] dl998l) . 



8,000 square degrees. The requirement that the Lyman-a absorp- 
tion fall in the range of the BOSS spectrograph requires that the 
quasars be in the redshift range 2.2 < z < 3.5. 

The quasars to be targeted must be chosen using only avail- 
able photometric information, mostly from the SDSS-I point- 
source catalog. The target selection method must be able to re- 
ject the non-quasar point-like objects (PLOs; mainly stars) by 
more than two orders of magnitude with a selection efficiency of 
QSOs better than 50%. The BOSS project needs a high density 
of z > 2.2 fainter QSOs (~ 20 QSOs per sq degree) and therefore 
requires the selection to be pushed up to g ~ 22. We developed 
a new method to select quasars using more information than the 
standard color selection methods. 

The classification of objects is a task that is generally per- 
formed by applying cuts on various distributions which distin- 
guish signal objects from background objects. This approach 
is not optimal because all the information (the shapes of the 
variable distributions, the correlations between the variables) is 
not exploited and this leads to a loss in classification efficiency. 
Statistical methods based on multivariate analysis have been de- 
veloped to tackle this kind of problem. For historical reasons 
these methods have been focused on linear problems which are 
easily tractable. In order to deal with nonlinearities, Artificial 



Neural Networks (NN) have been sho wn to be a powe rful tool in 
the classification task (see for instance Bish op! dl995l) ). 

By combining photometric measurements such as the mag- 
nitude values and their errors for the five bands (ugriz) of SDSS 
photometry, a NN approach will allow us both to select the QSO 
candidates and to predict their reds hift. Similar methods such as 
Kernel Density Estimation (KDE) dRichards et al.ll2004l2009bb 
already exist to select QSOs. Our approach based on NN is an 
extension of these methods because we will use more infor- 
mation (errors and absolute magnitude g instead of only colors 
(difference between two magnitudes)). Moreover, we propose to 
treat in parallel the determination of the redshift with the same 
tool. This approach contrasts with the usual methods to com- 
pute ph otometric redshift wh i ch deal with y 2 minimi zation tech- 
niques ( Richard setail 120011: IWinstein et all l2004t) . 

2. QSO and Background Samples 

The quasar candidates should be selected among a photometric 
catalog of objects including real quasars and what we will call 
background objects. Here, both for the background and QSO 
samples, the photometric information come s from the SDSS - 
DR7 imaging database of point-like objects (Abazaiian, 2009), 
PLOs. We apply the same quality cuts on the photometry for the 
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Fig. 2. Distributions of the discriminating variables used as input in the NN for objects classified as PLO in SDSS photometric 
catalog (blue dotted histogram) and for objects spectroscopically classified as QSO (red slashed histogram): a) Distribution of the 
PSF g magnitude, b), c), d) e) and f) Distributions of, respectively, <x(m), cr(g), <x(r), cr(i) and cr(z), the errors on the corresponding 
PSF magnitudes. 



two samples and select objects with g magnitude in the range 
18 < g < 22. Note that in the following, magnitudes will be 
point spread function (PSF) magnitudes (Lupton et all 1199 9) in 
the SDSS pseudo-AB magnitude system dOke & Gunnlll983l) . 

2.1. Background Sample 

For the background sample, we would ideally use an unbiased 
sample of spectroscopically confirmed SDSS point-like objects 
that are not QSOs. Unfortunately, we have no unbiased sam- 
ple of such objects because spectroscopic targets were chosen 
in SDSS-I to favor particular types of objects. Fortunately, the 
number of QSOs among PLOs is sufficiently small that using all 
PLOs as background does not affect the NN's ability to identify 
QSOs. We have verified that this strategy works by using the 
synthetic PLO catalog of lFanl d 19991) . We degraded the star sam- 
ple by adding a few percent of QSOs in it. then, we retrained the 



NN and we compared the NN trained with a pure star sample. 
We did not observe any significant worsening of the NN perfor- 
mances. 

The background sample used in the following was drawn 
from the SDSS PLO sample. We used objects with galactic lati- 
tude b around 45° to average the effect of Galactic extinction. In 
the future, we may consider the possibility of having a different 
NN for each stripe of constant galactic latitude. The final sample 
contains 30,000 PLOs: half of them constituting the "training" 
sample, the other half the "control" sample, as explained in the 
next section. 

2.2. QSO Sample 

For the QSO training sample, we use a list of 122,818 
spectroscopicall y-confirmed quasar s obtained from the 2QZ 
quasar catalog dCroom et all |2004|) . the SDSS-2dF LRG and 
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QSO Survey (2SLAQ) dCroom et all 12009). and the SDSS-DR7 
spectroscopic database (AbazajiarJ 120091) . These quasars have 
redshifts in the range 0.05 < z < 5.0 and g magnitudes in the 
range 18 < g < 22 (galactic extinction corrected). Since quasars 
will be observed over a limited blue wavelength range (down 
to about 3700 A), we will target only quasars with z > 2.2. 
Therefore, the sample of known quasars includes 33,918 QSOs 
with z > 1.8: half of them constituting the effective "training" 
sample, the other half the "control" sample. For the determina- 
tion of the photometric redshift, we use a wider sample of 95,266 
QSOs with z>l. 

In order to compare together QSOs with background ob- 
jects from different regions of the sky, the QSO magnitudes 
have been corrected for Galactic extinction with the model of 
ISchlegel et all (1 19981) . 

2.3. Discriminating variables 

The photometric in formation i s extr acted from the SDSS-DR7 
imaging database (Abazaiianl 120091) . The 10 elementary vari- 
ables are the PSF magnitudes for the 5 SPSS bands (ugriz) and 
their errors. As explained in Richard s et al.l d200 9b). the most 
powerful variables are the four usual colors (u - g,g — r,r — i,i - z) 
which combine the PSF magnitudes. Fig.Q]shows the 2D color- 
color distributions for the QSO and PLO samples. 

These plots give the impression that it is easy to disentan- 
gle the two classes of objects but one needs to keep in mind 
that the final goal is to obtain a 50% efficiency for QSOs with 
a non-quasar PLO efficiency of the order of ~ 10~ 3 . Therefore 
to improve the NN performances, we added the absolute mag- 
nitude g and the five magnitude errors. Their distributions for 
the two classes are given on Fig. [2] An improvement can be ex- 
pected from the additional variables and also from the correla- 
tions between the variables. Indeed, for example, it is expected 
that errors be larger for compact galaxies compared to intrinsic 
point-like objects. 

Note that the g distribution for the QSOs is likely to be bi- 
ased by the spectroscopic selection. This issue will be addressed 
in the future with the first observations of BOSS. Indeed the pho- 
tometric selection of QSOs for these first observations is based 
on loose selection criteria and it should provide a "less biased" 
catalog of spectroscopically confirmed quasars, close to com- 
pleteness up to g = 22. 



3. Neural Network Approach 

The basic building block of the NN architecture^ is a processing 

element called a neuron. The NN architecture used in this study 

is illustrated in Fig.[3]where each neuron is placed on one of four 

"layers", with N/ neurons in layer /, / = 1, 2, 3, 4. The output of 

each neuron on the first (input) layer is one of the N\ variables 

defining an object, e.g. magnitudes, colors and uncertainties. The 

inputs of neurons on subsequent layers {I = 2, 3,4) are the N/-i 

outputs of the previous layer, i.e. the xrf 1 , j = \,..,Ni-\. The 

inputs of any neuron are first linearly combined according to 

"weights", w'. . and "offsets" 6 1 . 
6 ' v J 



The output of neuron j on layer I is then defined by the non- 
linear function 



z 



/ > 2 



(1) 



1 For this study, both for target selection and redshift determination, 
we use a C++ package, TMultiL ayerPerceptron developed in the ROOT 
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2 < Z < 3 



(2) 



1 1 + exp (-y 1 ^ 

The fourth layer has only one neuron giving an output y^y = 
Vj, reflecting the likelihood that the object defined by the N] in- 
put variables is a QSO. 



hidden layers 



output 

layer 




Fig. 3. Schematic representation of the Neurone Network used 
here with Ni input variables, two hidden layers and one output 
neuron. 

Certain aspects of the NN procedure, especially the number 
of layers and the number of nodes per layer, are somewhat ar- 
bitrary and are chosen by experience and for simplicity. On the 
other hand, the weights and offsets must be optimized so that the 
NN output, vaw, correctly reflects the probability that an input 
object is a QSO. The NN must therefore be "trained" with a set 
of objects that are known to be QSOs or not QSOs (background 
objects). More precisely, the weights and offsets are determined 
by minimizing the "error" function 



1 " 

E = — ^(y N N(p) - y(p)f , 



(3) 



environment dBrun et aU[l995l) . 



where the sum is over n objects, p, and where y(p) is a discrete 
value defined as y(p) = 1 (resp. y(p) = 0) if the object p is a QSO 
(resp. is not a QSO). In the case of the NN developed to estimate 
a photometric redshift, the targeted value y(p) is a continuous 
value equal to the true spectrometric redshift, z S pectro- Note that 
in the NN architecture used for this study, the activation function, 
defined in Eq. [2] is not applied to the last neuron, allowing the 
output variable to vary in a range wider than [0; 1]. 

In this kind of classification analysis, the major risk is the 
"over-training" of the NN. It occurs when the NN has too many 
parameters (w/j and Of) determined by too few training objects. 
Over-training leads to an apparent increase in the classifica- 
tion efficiency because the NN learns by heart the objects in 
the training sample. To prevent such a behavior, the QSO and 
background samples are split into two independent sub-samples, 
called "training" and "control" samples. The determination of 
the NN parameters (w,y and Of) is obtained by minimizing the er- 
ror E, computed over the QSO and background training samples. 
The minimization is suspended as soon as the error for the con- 
trol samples stops decreasing even if the error is still decreasing 
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Fig. 4. a) NN output for objects classified as PLO in the SDSS photometric catalog, i.e. background objects, (blue dotted histogram) 
and for objects spectroscopically classified as QSO (red slashed histogram) in the control samples, using 10 discriminating variables: 
4 colors, g magnitude and errors on the five (u,g, r, i and z) magnitudes, b) PLO efficiency as a function of the QSO efficiency for 
three NN configurations. Blue dashed line: 4 colors (u — g, g — r, r — i, i - z). Black dotted line: 4 colors + g magnitude. Red solid 
line: 4 colors + g magnitude + errors on the five (u, g, r, i and z) magnitudes. The curves are obtained by varying the cut value, y™" 
for the two distributions of Fig. @}a. Efficiency is defined as the ratio of the number of objects with a NN output greater than y™" 
over the number of objects in the sample. The dots correspond, from left to right, to equal to, respectively, 0.2, 0.5, 0.8, 0.9, 
0.95 and 0.98. 



for the training samples. We have followed this procedure both 
for the target selection and the determination of the photometric 
redshift. 

The result of the NN training procedure is shown in Fig.HJ- 
a. The histograms of for the control QSO and background 
samples are overplotted. Most objects have either y^N ~ 1 (cor- 
responding to QSOs) or y^N ~ (corresponding to background 
objects). QSO target selection is achieved by defining a thresh- 
old value to be chosen between y^N = 1 and yaw ~ 0. 
The optimal value of the threshold is obtained by balancing the 
number of accepted QSOs against the number of accepted back- 
ground objects. A plot of the QSO efficiency vs. the background 
efficiency is shown in Fig.|3-b. 

4. Photometric Selection of Quasar 

For illustration, we have considered three NN configurations that 
differ by the number of discriminating variables. The first one 
uses only the four standard colors (u - g,g - r,r - i,i - z). In 
the second configuration, we add the absolute magnitude g and 
finally in the third configuration, the errors on the five PSF mag- 
nitudes are also taken into account. For each configuration, we 
have optimized the number of neurons in the hidden layers and 
the number of iterations in the minimization to get the best "PLO 
efficiency-QSO efficiency" curve. The three curves are superim- 
posed on Fig. 0J-b. Adding information, i.e discriminating vari- 
ables, clearly improves the classification performances. For in- 
stance, for a QSO efficiency of 50%, the PLO rejection fraction 
increases from 98.8%, to 99.4% and to 99.6% when the number 
of variables increases respectively from 4 to 5 and to 10. In the 
region of QSO efficiency in which we want to work, between 



50% and 80%, the PLO background is reduced by a factor 3 by 
adding 6 variables to the four usual colors. The small improve- 
ment found by using photometric errors may be due to a small 
contamination of the PLO catalog by compact galaxies. 

It is therefore apparent that the 10-variable NN should be 
used for the purpose of selecting quasars in any photometric 
catalog. In that case, the PLO rejection factors are respectively, 
99.6%, 99.2% and 98.5% for QSO efficiencies of 50%, 70% and 
85%. 

According to the McDonald & Eisensteinl d2007l) computa- 
tion based on the iJiang et alj d2006l) survey of faint QSOs, we 
expect ~20 QSOs per deg 2 , with g < 22 and 2.2 < z < 3.5. 
For a galactic latitude b ~ 45°, the number of objects selected 
in the SDSS-DR7 imaging database is -4000. Thus, with a QSO 
efficiency of 70% and a PLO efficiencjQ of 0.8%, we will select 
32 objects per deg 2 including ~14 "true" QSOs. These numbers 
corresponds roughly to what is required for BOSS project. 

5. Photometric Redshift of Quasar 

For the BOSS project, only quasars with a redshift in the range 
2.2 < z < 3.5 are useful. In the definition of the training sam- 
ple, we have already applied a cut on the redshift, z > 1.8, to 
reinforce the selection of high-z QSOs. But it is useful to add 
an additional constrain and select only QSOs with u - g > 0.4. 
This a-posteriori color cut helps to remove QSOs in the region 
0.8 < z < 2.2. However, we propose a more elegant method 
which consists of estimating the redshift of the QSO from the 
photometric information with another NN. 



2 Note that by its definition in Sec l2.ll the PLO sample contains 
QSOs. 
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Fig. 5. a) Photometric redshift determined with the NN (znn) as a function of the redshift measured from spectroscopy (z sp ectro)- 
b) The znn - Zspectro distribution is fitted with three gaussians contributing 93.4%, 6.4% and 0.2% of the histogram and of width, 
respectively, <x = 0.1, 0.4 and 1.0. The RMS of the znn - z sp earo distribution is 0.18 and its mean is 0.00. 



For the determination of the photometric redshift we use the 
same 10 variables as those in the NN for target selection. The 
difference is that in the definition of the error E, in Eq.[3] the tar- 
geted value y(p) is a continuous value equal to the true spectro- 
metric redshift, z spe ctro- Except for this difference, the NN archi- 
tecture is the same as for target selection with two hidden layers 
with the same number of hidden neurons. The minimization is 
computed with a single "training" sample of spectroscopically- 
confirmed QSOs and it is suspended as soon as the error E for 
the QSO "control" sample stops decreasing. 

Fig. [5}a shows the photometric redshift znn, deter- 
mined with the NN versus the spectroscopic redshift of the 
spectroscopically-confirmed QSOs. Most of the objects are dis- 
tributed along the diagonal demonstrating the good agreement 
between the two measurements. This can be quantified by plot- 
ting the difference znn ~ Zspectro 

(Fig-B-b). The fit of this distri- 
bution with three gaussians gives 93.4% and 6.4% of the objects 
respectively in core and wide Gaussians. The fraction of outliers, 
determined with the third Gaussian is only 0.2%. 

The corresponding distribution can be fitted with three 
Gaussian functions comprizing, respectively, 93.4%, 6.4% and 
0.2% of the distribution and of width, cr = 0.1, 0.4 and 1. 

Therefore, as shown on Fig. [6] by applying a conservative cut 
on the photometric redshift, znn > 2.1, we can remove 90.0% of 
the QSOs with z < 2.2. The fraction of lost QSOs with a redshift 
in the relevant region, 2.2 < z < 3.5, stays at a reasonable level 
of 5.3%. 



6. Conclusions 

In this paper we have presented a new promising approach to se- 
lect quasars from photometric catalogs and to estimate their red- 
shift. We use an Neurone Network with a multilayer perceptron 
architecture. The input variables are photometric measurements, 
i.e. the magnitudes and their errors for the five bands (ugriz) of 
the SDSS photometry. 

For the target selection, we achieve a PLO rejection factor of 
99.6% and 98.5% for, respectively, a quasar efficiency of 50% 
and 85%. The rms of the difference between the photometric 
redshift and the spectroscopic redshift is of the order of 0.15 



QSO 

QSO z NN >2.1 




3.5 4 4.5 
redshift z 

Fig. 6. Spectrometric redshift distribution in the QSO sample 
(blue slashed histogram). The distribution for the QSO passing 
the cut znn > 2.1 is overplotted (red dotted histogram). After 
this cut, 90.0% of the QSOs with z < 2.2 are removed and only 
5.3% of the QSOs in the 2.2 < z < 3.5 region are lost. 



over the region relevant for BAO studies. These new statistical 
methods developed in the context of the BOSS project can easily 
be extended to any other analysis requiring QSO selection and/or 
determination of their photometric redshift. 
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